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Abstract 

Classical regression methods treat covariates as a vector and estimate a corresponding vector 
of regression coefficients. Modern applications in medical imaging generate covariates of more 
complex form such as multidimensional arrays (tensors). Traditional statistical and computa- 
tional methods are proving insufficient for analysis of these high-throughput data due to their 
ultrahigh dimensionality as well as complex structure. In this article, we propose a new family of 
tensor regression models that efficiently exploit the special structure of tensor covariates. Under 
this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient 
estimation and prediction. A fast and highly scalable estimation algorithm is proposed for max- 
imum likelihood estimation and its associated asymptotic properties are studied. Effectiveness 
of the new methods is demonstrated on both synthetic and real MRI imaging data. 

Key Words: Brain imaging; dimension reduction; generalized linear model (GLM); magnetic 
resonance imaging (MRI); multidimensional array; tensor regression. 



1 Introduction 



Understanding the structure and function of the human brains and their connection with neuropsy- 



chiatric and neurodegenerative disorders is one of the most intriguing scientific questions (Towle 



eTaTI [T9931 [Niedermeyer and da SiTva] [2004t [Buzsakij [20061 Fassj [20081 |Lindquist| [20081 |Lazar| 



2008 Friston, 2009 Kang et al. , 2012). Rapidly advancing medical imaging technologies provide 



powerful tools to help address this question. There are a variety of imaging modalities, including 
anatomical magnetic resonance imaging (MRI), functional magnetic resonance imaging (fMRI), 
electroencephalography (EEG), diffusion tensor imaging (DTI), and positron emission tomography 



(PET), among others. The goal is to understand neural development of both normal brains and 
brains with mental disorders through one or several imaging modalities. The size and complexity of 
medical imaging data, however, are posing unprecedented demands for new statistical methods and 
theories. In this article, we focus on a family of problems using imaging data to predict cognitive 
outcome, to classify disease status, and to identify brain regions associated with clinical response. 



which have received increasing interest in recent years ( Lindquist 2008 ; Lazar , 2008 Martino et al 



2008 Friston, 2009 Ryah et al., 2010: Hinrichs et al., 2009). The problems can be formulated in 



a regression setup by treating clinical outcome as response, and treating images, which are in the 
form of multi- dimensional array, as covariates. However, most classical regression methods take 
vectors as covariates. Naively turning an image array into a vector is clearly an unsatisfactory so- 
lution. For instance, with a typical anatomical MRI image of size 256-by-256-by-256, it implicitly 
requires 256^ = 16, 777, 216 regression parameters. Both computability and theoretical guarantee of 
the classical regression models are compromised by this ultra-high dimensionality. More seriously, 
vectorizing an array destroys the inherent spatial structure of the image that possesses wealth of 
information. 

In the literature, there have been roughly three categories of solutions to establishing association 
between matrix/array covariates and clinical outcome. The first is the voxel-based methods, which 
take the image data at each voxel as responses and clinical variables such as age and gender as pre- 
dictors, and then generate a statistical parametric map of test statistics or p- values across all voxels 



(Lazar, 2008 Worsley et al. 2004). A major drawback is that they treat all voxels as independent 



units, since the fit is at individual voxel level, and thus ignore the fact that voxels are spatially 



correlated (Li et al. , 2011 Yue et al. , 2010; Polzehl et al. , 2010). The second type of solutions 



adopts the functional data point of view by taking a one-dimensional function as predictor. Fitting 
such models commonly involves representing functions as a linear combination of basis functions. 



which are either pre-specified or obtained from principal component decompositions (Ramsay and 



Silverman, 2005). Reiss and Ogden (2010) notably extended this idea to a functional regression 



model with two-dimensional images as predictors. Extending their method to 3D and higher di- 
mensional images, however, is far from trivial and requires substantial research, due to the large 
number of parameters and multi-collinearity among imaging measures. The third category employs 
a two-stage strategy, first carrying out a dimension reduction step, often by principal component 



analysis (PC A), then fitting a model based on the reduced-dimensional principal components (Caffo 



et al. , 2010). This strategy is intuitive and easy to implement. However, it is well known that PCA 



is an unsupervised dimension reduction technique and the extracted principal components can be 
irrelevant to the response. Moreover, theoretical properties of such two-stage solutions are usually 
intractable and no theoretical results are currently available. 

In this article, we propose a new class of regression models for array- valued covariates. Exploit- 
ing the array structure in imaging data, the new method substantially reduces the dimensionality 
of imaging data, which leads to efficient estimation and prediction. The method works for general 
array-valued covariates and/or any combination of them, and thus it is applicable to a variety 
of imaging modalities, e.g., EEG, MRI and fMRI. It is embedded in a generalized linear model 
(GLM) framework, so it works for both continuous and discrete responses. Within the proposed 
model framework, we develop both a highly scalable maximum likelihood estimation algorithm as 
well as statistical inferential tools. Regularized tensor regression is also developed to identify re- 
gions of interest in brains that are relevant to a particular response. This region selection problem 
corresponds to variable selection in the usual vector-valued regression. 

The contributions of this article are two-fold. First, from an image analysis point of view, our 
proposal timely responds to a number of growing needs of neuroimaging analysis. It also provides 
a systematic solution for the integrative analysis of multi-modality imaging data and imaging 



genetics data (Friston, 2009; Casey et al. , 2010). Second, from a statistical methodology point of 
view, our proposal provides a novel and broad framework for regression with array covariates. A 
large number of models and extensions are potential outcomes within this framework. Although 



there has been imaging studies utilizing tensor structure (Li et al. , 2005 Park and Savvides, 2007), 
our proposal, to the best of our knowledge, is the first work that integrates tensor decomposition 
within a statistical regression {supervised learning) paradigm. Our work can be viewed as a logic 
extension from the classical vector-valued covariate regression to functional covariate regression 
and then to array-valued covariate regression. 

The rest of the article is organized as follows. Section ^ begins with a review of matrix/array 
properties, and then develops the tensor regression models. Section [3] presents an efficient algorithm 
for maximum likelihood estimation. Section |4] provides theoretical results such as identifiability, 
consistency, and asymptotic normality. Section [5] discusses regularization including region selection. 
Section [6] presents numerical results. Section [7] concludes with a discussion of future extensions. 
Technical proofs are delegated to the Appendix. 



2 Model 

2.1 Preliminaries 

Multidimensional array, also called tensor, plays a central role in our approach and we start with 
a brief summary of notation and a few results for matrix/array operations. Extensive references 



can be found in the text (Magnus and Neudecker, 1999) for matrix calculus and the survey paper 



(Kolda and Bader, 2009) for tensors. In this article we use the terms multidimensional array and 



tensor interchangeably. 

First we review two matrix products. Given two matrices A 



[ai...a„] G IR" 



and 



B 



[bi . . . bq] G IR^^^, the Kronecker product is the mp-hy-nq matrix A B = [ai ® B ai 



B ... ar, 



B]. If A and B have the same number of columns n = q, then the Khatri-Rao prod- 



uct (Rao and Mitra, 1971) is defined as the mp-hy-n columnwise Kronecker product A Q B = 
[ ai (g) 6i a2 (8) ^2 • • • an bn ] . li n = q = I, then A Q B = A (>i) B. Next, we introduce 
some useful operations that transform a tensor into a matrix/vector. The vec(S) operator stacks 
the entries of a D-dimensional tensor B £ ^pi^ — ><pd j^^q ^ column vector. Specifically, an entry 
hi^,,,ij^ maps to the j-th entry of vecB, in which j = 1 + ^^^i{id — ^)Wd/'=iPd' ■ For instance, 
when D = 2, the matrix entry Xj^jj maps to position j = 1 + ii — 1 + {12 — l)pi = ii + (^2 — l)pi, 
which is consistent with the more familiar vec operation on a matrix. The mode-d matricization, 
B(^^-^, maps a tensor B into a pd x Yl^i^^Pd' matrix such that the (ii, . . . , id) element of the array 
B maps to the {id,j) element of the matrix -B(rf), where j = 1 + J2d'^di'^d' — ^)Ild"<d',d"^dPd"- 
With d = 1, we observe that vecB is the same as vectorizing the mode-1 matricization Bny 
The mode-(d,d' ) matricization B(^dcLi) G '^PiPd'^^UdH^dA' PdH jg defined in a similar fashion (Kolda 



2006 ) . We also introduce an operator that turns vectors into an array. Specifically, an outer prod- 
uct, 61 o 62 o • • • o bi), of D vectors bd G IR^'', d = 1, . . . , D, is a pi x • • • x p/5 array with entries 
(61 o 62 o • • • o bD)n-io = ridLi ^did- 

Next we introduce a concept that plays a central role in our proposed tensor regression in 



Section 2.3 We say an array B G IRJ'j^'"^pb admits a rank-R decomposition if 

R 

EM' 



B 



^^'■^o.-.o/^W, 



(1) 



r=l 



iir) 



where l3^ G IR^"^, d = 1, . . . ,D,r = 1, . . . ,R, are all column vectors, and B cannot be written as 
a sum of less than R outer products. For convenience, the decomposition is often represented by a 
shorthand, B = {Bi, . . . , Bd}, where Bd = [/3^ 



^d ' • • • ' ^^d 



/3^^)] (zMP^xR^ d = l,...,D (Kolda 



2006 



Kolda and Bader, 2009). The following well-known result relates the mode-d matricization and 



the vec operator of an array to its rank-i? decomposition. The proof is given in the Appendix for 
completeness. 

Lemma 1. If a tensor B G ^pix--^pd admits a rank-R decomposition |il), then 

B^d) = Bd{BD • • • Bd+i Bd-i • • • Bif and vecB = {Bd Q ■ ■ ■ Q Bi)1r. 

Throughout the article, we adopt the following notations, y is a univariate response variable, 
Z G IR^" denotes a po-dimensional vector of covariates, such as age and sex, and X G ]i^Pi^---^PD jg g, 
D- dimensional array-valued predictor. For instance, for MRI, D = 3, representing the 3D structure 
of an image, whereas for fMRI, Z? = 4, with an additional time dimension. The lower-case triplets 
{yi,Xi,Zi), i = 1, . . . ,n, denote the independent, observed sample instances of {Y, X, Z). 

2.2 Motivation and Basic Model 

To motivate our model, we first start with a vector- valued X and absorb Z into X . In the classical 



GLM (McCullagh and Nelder, 1983) setting, Y belongs to an exponential family with probability 



mass function or density. 



p{y\e, ct>) = exp <! ^-^^j^ + c{y, ct>) \ (2) 



where 9 and (p > {) denote the natural and dispersion parameters. The classical GLM relates 
a vector-valued X G IR^ to the mean // = E{Y\X) via g{^) = r] = a + (3^X, where g{-) is a 
strictly increasing link function, and r/ denotes the linear systematic part with intercept a and the 
coefficient vector /3 G IR^. 

Next, for a matrix- valued covariate X G IRP^^p^ ^d — 2)^ it is intuitive to consider a GLM 
model with the systematic part given by 

5(/x) = a + (31X(32, 

where /3^ G IR''' and (32 G IR^'^, respectively. The bilinear form P\X(32 is a natural extension of 
the linear term /3^X in the classical GLM with a vector covariate X. It is interesting to note that. 



this bilinear form was first proposed by Li et al. (2010) in the context of dimension reduction, and 



then employed by Hung and Wang (2011|) in the logistic regression with matrix- valued covariates 



{D = 2). Moreover, note that /3{X/32 = {(32 /3i)"'vec(X). 

Now for a conventional vector- valued covariate Z and a general array- valued X G IRP^^--^Pfl, 
we propose a GLM with the systematic part given by 

5(m) = q + 7^-^ + (/3i) • • • /3i)Vec(X), (3) 



where 'y G IR^" and /3^ £ IR^"* for d = 1, . . . ,D. This is our basic model for regression with array 
covariates. The key advantage of model (pi) is that it dramaticahy and effectively reduces the 
dimensionality of the tensor component, from the order of JldP'^ ^° ^^^^ order of X^^^Pd- Take MRI 
imaging as an example, the size of a typical image is 256^ = 16, 777, 216. If we simply turn X into a 
vector and fit a GLM, this brutal force solution is over 16 million-dimensional, and the computation 
is practically infeasible. By contrast, (pi) turns the problem to be 256+256+256 = 768-dimensional. 
The reduction in dimension, and consequently in computational saving, is substantial. 

A critical question then is whether such a massive reduction in the number of parameters would 
limit the capacity of model (pi) to capture regions of interest with specific shapes. The illustrative 
example in Figure [T] provides some clues. In Figure [l| we present several two-dimensional images 
B G ]R,^^^^^ (shown in the first column), along with the estimated images by model (pi) (in the 
second column labeled by TR(1)). Specifically, we simulated 1,000 univariate responses yi according 
to a normal model with mean /Uj = 7^Zj + {B, Xi), where 7 = I5. The inner product between two 
arrays is defined as {B,X) = {vecB,vecX) = J2i i (^ii...iD^ii---iD- The coefficient array B is 
binary, with the true signal region equal to one and the rest zero. The regular covariate Zi and 
image covariate Xi are randomly generated with all elements being independent standard normals. 
Our goal is to see if model ([s]) can identify the true signal region in B using data {yi, Zi,Xi). Before 
examining the outcome, we make two remarks about this illustration. First, our problem differs 



from the usual edge detection or object recognition in imaging processing (Qiu, 2005, 2007). In 
our setup, all elements of the image X follow the same distribution. The signal region is defined 
through the coefficient image B and needs to be inferred from the association between Y and X 
after adjusting for Z. Second, the classical GLM is difficult to apply in this example if we simply 
treat vec{X) as a covariate vector, since the sample size n = 1,000 is much less than the number 
of parameters p = 5 + 64 x 64 = 4, 101. Back to Figure [T| the second column clearly demonstrates 
the ability of model (pi) in identifying the rectangular (square) type region (parallel to the image 
edges). On the other hand, since the parameter vector /3^ in a rank-1 model is only able to capture 
the accumulative signal along the d-th dimension of the array variate X, it is unsurprising that it 
does not perform well for signals that are far away from rectangle, such as triangle, disk, T-shape 
and butterfly. This motivates us to develop a more flexible tensor regression model in the next 
section. 
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Figure 1: True and recovered image signals by tensor regression. The matrix variate has size 64 
by 64 with entries generated as independent standard normals. The regression coefficient for each 
entry is either (white) or 1 (black). The sample size is 1000. TR(i2) means estimate from the 
rank-i? tensor regression. 



2.3 Tensor Regression Model 

We start with an alternative view of the basic model ([3]), which will lead to its generalization. 
Consider a D-dimensional array variate X G jj^pjx-xpd^ ^j^d a full coefficient array B of same 
size that captures the effects of each array element. Then the most flexible GLM suggests a linear 
systematic part 

g{^i) = a + -f^Z + {B,X). 

The issue with this model is that B has the same number of parameters, 11^=1 Prf' ^^ ^' which is 
ultrahigh dimensional and far exceeds the usual sample size. Then a natural idea is to approximate 
B with less parameters. If B admits a rank-1 decomposition (II|, i.e., B = /J^ o /32 o • • • o (3£), where 
Pd G IR^'', then by Lemma [T| we have 

vecS = vec (/3i o /32 o • • • o /3^) = /3^ © • • • /3^ = /3^ (g) • • • (^ /3^. 

In other words, model (JSJ) is indeed a data-driven model with a rank-1 approximation to the general 
signal array B. This observation motivates us to consider a more flexible tensor regression model. 
Specifically, we propose a family of rank-R generalized linear tensor regression models, in which 
the systematic part of GLM is of the form 

r=l 

= a + -/^Z+{{BDQ---QBi)lR,vecX), (4) 

where B = IBi, . . . , BjjJ = 2Jr=i Pi ° (^2 ° ' ' ' ° I^d '^ i-^-i i^ admits a rank-i? decomposition, 
Bd = [/3i^\ . . . ,/3i^^] G IRP'*''-^, Si50- • -OBi G IROdPrfX-R is the Khatri-Rao product and 1r is the 
vector of R ones. When i? = 1, it reduces to model ([3]). A few remarks on Q are in order. First, 
since our formulation only deals with the linear predictor part of the model, it easily extends to the 



quasi-likelihood models (McCullagh and Nelder 1983) where more general mean- variance relation 
is assumed. Second, for simplicity, we only discuss exponential family with a univariate response. 
Extension to multivariate exponential family, such as multinomial logit model, is straightforward. 
Third, due to the GLM setup (pi), we call Q a generalized linear tensor regression model. However, 
we should bear in mind that the systematic component r/ is a polynomial rather than linear in the 
parameters B^- Finally, the rank-i? tensor decomposition (II]) is called canonical decomposition or 



parallel factors (CANDECOMP/PARAFAC, or CP) in psychometrics ( |Kolda and BaderH2009| ) . In 
that sense, model ([4]) can be viewed as a supervised version of the classical CP decomposition for 
multi-dimensional arrays. 



The number of parameters in model Q is po + -^ Sd Pd, which is still substantially smaller than 
Pq + ndP<i- With such a massive reduction in dimensionality, however, we demonstrate it provides 
a reasonable approximation to many low rank signals. Returning to the previous illustration, in 
Figure [II images TR(i?) are the recovered signals by the rank-i? tensor regression (in third and 
fourth columns). The square signal can be perfectly recovered by a rank-1 model, whereas rank-2 
and 3 regressions show signs of overfitting. The T-shape and cross signals can be perfectly recovered 
by a rank-2 regression. Triangle, disk, and butterfly shapes cannot be exactly recovered by any low 
rank approximations; however, a rank 3 tensor regression already yields a fairly accurate recovery. 
Clearly, the general tensor regression model Q is able to capture significantly more tensor signals 
than the basic model ([3|. 

3 Estimation 



n 



We pursue the maximum likelihood (ML) route for parameter estimation in model Q. Given 
i.i.d. data {{yi, Xi, Zi),i = 1, . . . , n}, the log-likelihood function for ^ is 

where 6i is related to regression parameters (a, 7, Si, ... , Bd) through Q. We propose an efficient 
algorithm for maximizing £(a, 7, Bi, . . . , Bd). A key observation is that although g{fi) in Q is not 
linear in {Bi, . . . ,-Bd) jointly, it is linear in B^ individually. This suggests alternately updating 
(a, 7) and B^, d = 1,...,D, while keeping other components fixed. It yields a so-called block 



relaxation algorithm (de Leeuw 1994 Lange, 2010). An appealing feature of this algorithm is 



that at each iteration, updating a block Bd is simply a classical GLM problem. To see this, when 
updating B^ G IR^''^ , we rewrite the array inner product in (|4|) as 

R 

{J2 /3f ^ o /3f ) o . . . o (3^;;\X) = {Bd, X^d){BD • • • Bd+i B^-i • • • B^)). 

r=l 

Consequently the problem turns into a traditional GLM regression with Rpd parameters, and the 
estimation procedure breaks into a sequence of low dimensional GLM optimizations and is extremely 
easy to implement using ready statistical softwares such as R, S+, SAS, and Matlab. The full 
estimation procedure is summarized in Algorithm [TJ For the Gaussian models, it reduces to the 



alternating least squares (ALS) procedure (de Leeuw et al. , 1976). 



As the block relaxation algorithm monotonically increases the objective function, it is numeri- 
cally stable and the convergence of objective values i{6^^') is guaranteed whenever £{6) is bounded 

9 



Algorithm 1 Block relaxation algorithm for maximizing (pi). 



Initialize: {a^^',^^^') = argmax^-, ^(a,7, 0, . . . , 0), B^ G IR^'^^'^ a random matrix for d = 

1,...,D. 

repeat 

for d = 1, . . . , D do 

end for 

(«(*+!), 7(*+i)) = argmax,^^ ^(a, 7, sf +'\ . . . , sg+')) 

until£(6>(*+i))-£(6>W) <e 

from above. Therefore the stopping rule of Algorithm [T] is well-defined. We denote the algorithmic 
map by M, i.e., M{6^^') = 6^^~^^' , with = (0,7, Bi, . . . , Bo) collecting all parameters. Conver- 
gence properties of Algorithm [T] are summarized in Proposition [T} The proof is relegated to the 
Appendix. 

Proposition 1. Assume (i) the log-likelihood function i{0) is continuous, coercive, i.e., the set 
{6 : £{6) > i{6^ ')} is compact, and bounded above, (ii) the objective function in each block up- 
date of Algorithm [7] is strictly concave, and (Hi) the set of stationary points (modulo scaling and 
permutation indeterminancy) of i{9) are isolated. We have the following results. 

1. (Global Convergence) The sequence 6^ ' = (a'-*-', 7'*-*, B\ , . . . , Bj^) generated by Algorithm\l\ 
converges to a stationary point of l{0). 

2. (Local Convergence) Let 6^°^' = {a^'^',^^°°',B\°°,...,Bj^) be a strict local maximum of 
i{0). The iterates generated by Algorithm^ are locally attracted to 6^°^' for 6^^' sufficiently 
close toe^^\ 

We make a few quick remarks. First, although a stationary point is not guaranteed to be even a 
local maximum (it can be a saddle point), in practice the block relaxation algorithm almost always 
converges to at least a local maximum. In general, the algorithm should be run from multiple 
initializations to locate an excellent local maximum. Second, f.{6) is not required to be jointly 
concave in 0, but only the concavity in the blocks of variables is needed. This condition holds for 
all GLM with canonical link such as linear regression and Poisson regression with exponential link. 

The above algorithm assumes a known rank when estimating B. Estimating an appropriate 
rank for our tensor model Q is of practical importance. It can be formulated as a model selection 
problem, and we adopt the usual model section criterion, e.g., Bayesian information criterion (BIC), 
—2i{6)-\-\og{n)p(,, where Pe is the effective number of parameters for model Ml): Pf, = R{pi-\-p2) — Fl? 



10 



for D = 2, andpe = -R(X^(iPd— -D+1) for D > 2. Returning to the illustrative example in Section 2.2 
we fitted a rank-1, 2 and 3 tensor models, respectively, to various signal shapes. The corresponding 
BIC values are shown in Figure [l} The criterion is seen correctly estimating the rank for square as 
1, and the rank for T and cross as 2. The true ranks for disk, triangle and butterfly are above 3, 
and their BIC values at rank 3 are smallest compared to those at 1 and 2. 

4 Theory 

We study the statistical properties of maximum likelihood estimate (MLE) for the tensor regression 
model defined by ([2]) and Q. For simplicity, we omit the intercept a and the classical covariate 
part 7^.2', though the conclusions generalize to an arbitrary combination of covariates. We adopt 
the usual asymptotic setup with a fixed number of parameters p and a diverging sample size n, 
because this is an important first step toward a comprehensive understanding of the theoretical 
properties of the proposed model. The asymptotics with a diverging p is our future work and is 
pursued elsewhere. 

4.1 Score and Information 

We first derive the score and information for the tensor regression model, which are essential for 
statistical estimation and inference. The following standard calculus notations are used. For a 
scalar function /, V/ is the (column) gradient vector, df = [V/]^ is the differential, and dP f is 
the Hessian matrix. For a multivariate function g : IR^ i— )• IR'^, Dg G IR'^^^ denotes the Jacobian 
matrix holding partial derivatives dgi/dxj. 

We start from the Jacobian and Hessian of the systematic part rj = g(/x) in Q. The proof is 
given in the Appendix. 

Lemma 2. 1. The gradient Vrj{Bi, ... , Bd) G ]R^^d=' ^"^ is 

Vr]{Bi, ...,Bd) = [JiJ2 ■■■ JdV (yecX), 
where J^ G IRnd=j Pd^PdR ^g ^/jg Jacobian 

Jd = DB{Bd) = Ud[{BD • • • Bd+i Ba-i • • • Bi) IpJ (6) 

and Ilrf is the {Y\d=iPd)'by-iY[d=iPii) permutation matrix that reorders vecB^d) to obtain 
vecB, i.e., vecB = UdyecB^^d)- 
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2. The Hessian (Pr]{Bi, ..., Bd) G 11^^^=^ p^x-RELj Pd has entries 
and can be partitioned in D^ blocks as 



(r) 
3d" ' 
Jd=id,Jd'=^d' d"^d,d' 



/ * * * \ 

H21 * * 



: : ■ . * 

\ Hdi Hd2 • • • / 

The block H^d' G IRPrf^^Pd'^ has pdPd'R nonzero elements which can be retrieved from the 
matrix X(^ad'){BD • • • Bd+i Bd-i • • • Bd'+i Bd'-i • • • Si), where X(^m') is the 
mode-{d,d') matricization of X. 

Remark 1: The Hessian d'^r] is highly sparse and structured. An entry in d'^r]{Bi, . . . , Bd) is 
nonzero only if it belongs to different directions d but the same outer product r. 

Let i{Bi, . . . , B£)\y, x) = \i\p{y\x, Si, ... , Bn) be the log-density. Next result derives the score 
function, Hessian, and Fisher information of the tensor regression model. 

Proposition 2. Consider the tensor regression model defined by M) and (Op. 

1. The score function (or score vector) is 

Ve{B,, ...,Bd)= ^^~ ^Y^"^^ [Ji . . . JonyecX) (7) 

with Jd, d = 1, . . . ,D, defined by ^. 

2. The Hessian of the log- density i is 

H{B,, ...,Bd)= - ^^^^([Ji . . . JD]"vecX)([Ji . . . JnVvecXy 

+ k^:^)^!M([j, . . . J^]TvecX)([Ji . . . JoYyecXy 
^iy^JpI]ld^,^B„...,Bn), (8) 

with d^r] defined in Lemma\^ 

3. The Fisher information matrix is 

I{Bi,...,Bd) = E[-H{Bi,...,BD)]=yai[Vi{Bi,...,BD)di{Bu...,BD)] 

= M^[j^...j^]T(vecX)(vecX)^[Ji...Jz)]. (9) 
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Remark 2: For canonical link, 9 = r], 9'{r]) = 1, 9"{r]) = 0, and the second term of Hessian 
vanishes. For the classical GLM with linear systematic part {D = 1), cPrj^Bi, . . . , Bjj) is zero and 
thus the third term of Hessian vanishes. For the classical GLM {D = 1) with canonical link, both 
the second and third terms of the Hessian vanish and thus the Hessian is non-stochastic, coinciding 
with the information matrix. 

4.2 Identifiability 

Before studying asymptotic property, we need to deal with the identifiability issue. The param- 
eterization in the tensor model is nonidentifiable due to two complications. Consider a rank-i? 
decomposition of an array, B = ^Bi , . . . , B^J . The first complication is the indeterminacy of B 
due to scaling and permutation: 

- scaling: B = iBiAi, . . . ,Bj;)A£)} for any diagonal matrices A^ = diag{Xdi, ■ ■ ■ ,XdR), d = 
1, . . . , D, such that Hd ^dr = 1 for r = 1, . . . , -R. 

- permutation: B = \BiYi, . . . , S/)!!]! for any R-hy-R permutation matrix 11. 

For the matrix case (D = 2), a further complication is the nonsingular transformation indeter- 
minancy: B1B2 = -BiOO^^BJ ^^ ^^Y R-i>y-R nonsingular matrix O. Note the scaling and 
permutation indeterminancy is subsumed in the nonsingular transformation indeterminancy. The 
singular value decomposition (SVD) of a matrix is unique because it imposes orthonormality con- 
straint on the columns of the factor matrices. 

To deal with this complication, it is necessary to adopt a specific constrained parameterization 
to fix the scaling and permutation indeterminacy. For D > 2, we need to put {D — 1)R restrictions 
on the parameters B and apparently there is an infinite number of ways to do this. In this paper 

(r) 

we adopt the following convention. Bi, . . . , B^^i are scaled such that (3^^ = 1, i.e., the first rows 
are ones. This in turn determines entries in the first row of B^ and fixes scaling indeterminacy. 
To fix the permutation indeterminancy, we assume that the first row entries of B^ are distinct and 
arranged in descending order /3]-,j > • • • > l3}j-{ . The resulting parameter space is 

B = {(Bi,. . .,Bd) :/3iJ = 1, for d = 1, ... ,Z?,r = 1, ... ,i? and /3g; >••• > p'j^^}, 

which is open and convex. The formulae for score, Hessian and information in Proposition [2] require 
changes accordingly, i.e., the entries in the first rows of B^, d = 1, . . . ,D — 1, are fixed at ones 
and their corresponding entries, rows and columns in score, Hessian and information need to be 
deleted. Treatment for the D = 2 case is similar and omitted for brevity. We emphasize that our 
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choice of the restricted space B is arbitrary and exclude many arrays that might be of interest, e.g., 
arrays with any entries in the first rows of B^, d = 1, . . . ,D — 1, equal to zeros or with ties in the 
first row of B^. However the set of such exceptional arrays has Lebesgue measure zero. In specific 
applications, subject knowledge may suggest alternative restrictions on the parameters. 

The second complication comes from possible non-uniqueness of decomposition when D > 2 
even after adjusting scaling and permutation indeterminacy. The next proposition collects some 
recent results that give easy-to-check conditions for the uniqueness (up to scaling and permutation) 
of decomposition. The first two are useful for checking uniqueness of a given tensor, while the latter 
two give general conditions for uniqueness almost everywhere in the D = 3 or 4 case. 

Proposition 3. Suppose that a D-dimensional array B G ]RPjx--xpb j^as rank R. 

1. (Sufficiency) l^Sidiropoulos and Bro , \2000 ) The decomposition ^ is unique up to scaling and 



^D 



permutation if Yld=i ^Bd — 2-R ~^ (D — 1), where k^ is the k-rank of a matrix A, i.e., the 
maximum value k such that any k columns are linearly independent. 



2. (Necessity) {Liu and Sidiropoulos , 2001) If the decomposition Tw is unique up to scaling and 
permutation, then VQ.md=i^...^D rank(Si • • • Bd-i Bd+\ • • • Bd) = R, which in turn 
implies that Tsim.d=i^...^D iWd'^d^'^^'^^i-^d')) > ^• 



3. jde Lathauwerj \2M§ When D = 3, R < ps and R{R - 1) < pi{pi - l)p2{P2 - l)/2, the 
decomposition (Wp is unique for almost all such tensors except on a set of Lebesgue measure 
zero. 



4- (de Lathauwer . 2006) When D = 4, R < p4 and R{R — 1) < PiP2P3{'ipiP2P3 — P1P2 — P1P3 — 
P2P3 — Pi —P2 — P3 + 3)/4, the decomposition (Mj is unique for almost all such tensors except 
on a set of Lebesgue measure zero. 

Next we give a sufficient and necessary condition for local identifiability. The proof follows 



from a classical result (Rothenberg, 1971 1 that relates local identifiability to the Fisher information 
matrix. 

Proposition 4 (Identifiability). Given iid data points {{yi,Xi),i = l,...,n} from the tensor re- 
gression model. Let Bq & B be a parameter point and assume there exists an open neighborhood 
of Bq in which the information matrix has a constant rank. Then Bq is locally identifiable up to 
permutation if and only if 



HBo) 



■ JdY 



2_] ^ — (vec ajj) (vec XiY 



i=l 



fyj 



[ Ji . . . Jd] 
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is nonsingular. 

Remark 3.1: Proposition [4] explains the merit of tensor regression from another angle. For 
identifiability, the classical linear regression requires vecaij S IR^^''^'*, z = 1, . . . ,n, to be linearly 
independent in order to estimate all parameters, which requires a sample size n > YldPd- "^^^ 
more parsimonious tensor regression only requires linearly independence of the "collapsed" vec- 
tors [Ji . . . j£)YvecXi G ]R,-^(z^dPrf~^+-')j i = 1, . . . ,n. The requirement on sample size is greatly 
lessened by imposing structure on the arrays. 

Remark 3.2: Although global identifiability is hard to check for a finite sample, a parameter point 
B & B \s asymptotically and globally identifiable as far as it admits a unique decomposition up to 
scaling and permutation and ^"^^(veciCj)(veciCj)^ has full rank for n > tlq, or, when considered 
stochastically, E[(vec JC)(vecX)^] has full rank. To see this, whenever ^"^j^(vecKj)(veciCj)^ has 
full rank, the full coefficient array is globally identifiable and thus the decomposition is identifiable 
whenever it is unique. 

Generalizing the concept of estimable functions for linear models, we call any linear combination 
of {xi, J2r=i f^i o ■ ■ ■ ° f^D '■• i = ^, ■ ■ ■ ,n, as an estimable function. We can estimate estimable or 
collection of estimable functions even when the parameters are not identifiable. 

4.3 Asymptotics 

The asymptotics for tensor regression follow from those for MLE or M-estimation. The key obser- 
vation is that the nonlinear part of tensor model Q is a degree- Z) polynomial of parameters. Then 
the classical Wald's continuity condition and Cramer's smoothness condition become trivial. Our 
proofs in the Appendix are based on the uniform convergence conditions using Glivenko-Cantelli 
theory. For that purpose, note that the collection of polynomials {{B, X), B ^ B} form a Vapnik- 



Cervonenkis (VC) class. Then standard theory for M-estimation (van der Vaart, 1998) applies 



Theorem 1 (Consistency). Assum,e Bq = ^Bqi,. . . , -BodJ £ B is (globally) identifiable up to per- 
mutation and the array covariates Xi are iid from a bounded distribution. The MLE is consistent, 
i.e., Bn converges to Bq (modulo permutation) in probability, in the following models: (1) normal 
tensor regression with a compact parameter space Bq C B; (2) binary tensor regression; and (3) 
poisson tensor regression with a compact parameter space Bq C B. 

Remark ^; (Misspecified Rank) In practice it is rare that the true regression coefficient Btme £ 
^Pix — xpD jg exactly a low rank tensor. However the MLE of the rank-i? tensor model converges 
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to the maximizer of function M{B) = IPst^uc Inps or equivalently ^Btr^c^'^iPB/PBu^c)- In other 
words, the MLE is consistently estimating the best rank-i? approximation of -B^rue in the sense of 
Kullback-Leibler distance. 

To estabhsh the asymptotic normahty of -B„, we note that the log-hkehhood function of tensor 
regression model is quadratic mean differentiable (q.m.d). 

Lemma 3. Tensor regression model is quadratic mean differentiable (q.m.d.). 

The next result follows from the asymptotic normality result for models that satisfy q.m.d. The 
proof is given in the Appendix. 

Theorem 2 (Asymptotic Normality). For an interior point Bq = [[-Bqi, . . . , -BqdI S B with non- 
singular information matrix I{Bqi, . . . , Bqd) ^ and Bn is consistent, 



/n[vec{Bni, ..., Bno) - vec(Soi, . . . , Bqd)] 
converges in distribution to a normal with mean zero and covariance I^^{Bqi, . . . , Bqd). 

5 Regularized Estimation 

The sample size in most neuroimaging studies is quite small, and thus even for a rank-1 tensor 
regression ([3|, it is likely that the number of parameters exceeds the sample size. Therefore, the 
p » n challenge is a rule rather than an exception in neuroimaging analysis, and regularization 
becomes essential. Even when the sample size exceeds the number of parameters, regularization 
is still useful for stabilizing the estimates and improving their risk property. We emphasize that 
there are a large number of regularization techniques for different purposes. Here we illustrate with 
using sparsity regularization for identifying sub-regions that are associated with the response traits. 
This problem can be viewed as an analogue of variable selection in the traditional vector- valued 
covariates. Toward that end, we maximize a regularized log-likelihood function 

D R Pd 
d=l r=l j=l 

where Px{\f3\,p) is a scalar penalty function, p is the penalty tuning parameter, and A is an 



index for the penalty family. Some widely used penalties include: power family (Frank and 
Friedman, 1993), in which Px{\/3\,p) = p\/3\^, X € (0,2], and in particular lasso (Tibshirani 



1996D (A = 1) and ridge (A = 2); elastic net ( |Zou and Hastiej |2005D , in which Pa(|/3|,p) 
/9[(A-l)/372 + (2-A)|/3|],A G [1,2]; and SCAD ( |Fan and Li| |200l[ ), in which (9/a|/3|PA(|/3|,/o) 
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p {l{|/3|<p} + (A/3 — |/3|)_|_/(A — l)pl||^|>p}.|, A > 2, among many others. Choice of penalty function 

and tuning parameters p and A depends on particular purposes: prediction, unbiased estimation, 

or region selection. 

Regularized estimation for tensor models incurs slight changes in Algorithm [T} When updating 

Bd-, we simply fit a penalized GLM regression problem, 

R Pd 
Bf' = argmax,, ^(a(*), 7^^ S^^), . . . , 5^,^ S„ <„ . . . , B^^) - J] Y. ^^(1/3^?^), 

r=l i=l 

for which many software packages exist. Same paradigm certainly applies to regularizations other 
than sparsity. The fitting procedure boils down to alternating regularized GLM regression. The 
monotone ascent property of Algorithm [T] is retained under the modified algorithm, giving rise to 
stability in the estimation algorithm. Convex penalties, such as elastic net and power family with 
A > 1, tend to convexify the objective function and alleviate the local maximum problem. On the 
other hand, concave penalty such as power family with A < 1 and SCAD produces more unbiased 
estimates but the regularized objective function is more ruggy and in practice the algorithm should 
be initialized from multiple start points to increase the chance of finding a global maximum. Many 
methods are available to guide the choice of the tuning parameter p and/or A for regularized GLM, 



notably AIC, BIC and cross validation. For instance the recent work (Zhou et al. , 2011) derives 
BIG type criterion for GLM with possibly non-concave penalties such as power family, which can 
be applied to regularized tensor regression models in a straightforward way. 

Two remarks are in order. First, it is conceptually possible to apply these regularization tech- 
niques directly to the full coefficient array B G IRndP^ without considering any structured decom- 
position as in our models. That is, one simply treats vecX as the predictor vector as employed 
in the classical total variation regularization in image denoising and recovery. However, for the 
brain imaging data, we should bear in mind the dimensionality of the imaging arrays. For instance, 
to the best of our knowledge, no software is able to deal with fused lasso or even simple lasso on 
64^ = 262, 144 or 256^ = 16, 777, 216 variables. This ultrahigh dimensionality certainly corrupts 
the statistical properties of the regularized estimates too. Second, penalization is only one form of 
regularization. In specific applications, prior knowledge often suggests various constraints among 
parameters, which may be exploited to regularize parameter estimate. For instance, for MRI imag- 
ing data, sometimes it may be reasonable to impose symmetry on the parameters along the coronal 
plane, which effectively reduces the dimensionality by pdR/2. In many applications, nonnegativity 
of parameter values is also enforced. 
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6 Numerical Analysis 

We have carried out an extensive numerical study to investigate the finite sample performance 
of the proposed methods. In this section, we report some selected results from some synthetic 
examples and an analysis of a real brain imaging data. 

6.1 2D Shape Examples 



We first elaborate on the illustrative example given in Section 2.2 with a collection of 2D shapes. 
In the first study, we varied the sample size to demonstrate the consistency of tensor regression 
estimation. In the second study, we illustrate regularized tensor regression estimation. 

6.1.1 Tensor Regression Estimation 



We employ the tensor model setup in Section 2.2, where the response is normally distributed with 



mean, ij = ^'''Z + {B,X), and standard deviation one. Here X is a 64 x 64 2D matrix, Z is 
a 5-dimensional covariate vector, both of which have standard normal entries, 7 = (1, 1, 1, 1, 1)^, 
and B is binary with the true signal region equal to one and the rest zero. We examine various 
sample sizes at n = 500, 750 and 1000, and report the results under a tensor model whose rank is 
determined by BIC. Table [T] summarizes the mean results out of 100 data replications. Reported 
criterion is root mean squared error (RMSE) for both the regular vector coefficient 7 and the array 
coefficient B. It is clearly seen that the estimation accuracy increases along with the sample size, 
demonstrating the consistency of the proposed method. 

6.1.2 Regularized Tensor Regression Estimation 



Next we revisit the example in Section |2.2| to illustrate regularized tensor regression estimation. 
The setup is the same as that in Figure [T] except that the sample size is reduced to 500, which 
is only barely larger than the number of parameters 417 = 5 + 3 x (64 + 64) of a rank-3 tensor 
model. Figure [2] shows the outcome of applying the lasso penalty to B^ in the rank-3 tensor 
regression model. Recovered signals at three different values of A = 0, 100, 1000 are displayed. 
Without regularization (A = 0), the rank-3 tensor regression is difficult to recover some signals 
such as triangle, disk and butterfly, mainly due to a very small sample size. On the other hand, 
excessive penalization compromises the quality of recovered signals too, as evidently in those shapes 
at A = 1, 000. Regularized estimation with an appropriate amount of shrinkage improves estimation 
quality, as seen in the triangle and disk at A = 100 and in butterfly at A = 1000. In practice the 
tuning parameter is chosen by a certain model selection criterion such as BIC or cross validation. 
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Table 1: Tensor regression estimation for the 2D shape examples. Reported are mean RMSE and 
its standard deviation (in parenthesis) based on 100 data rephcations. 



Shape 


Par am. 


n = 


500 


n = 


750 


n = 


1000 


Square 


7 
B 


0.0486 
0.0091 


;0.0173) 

;o.ooo7) 


0.0387 
0.0071 


'0.0129) 

;o.ooo5) 


0.0325 
0.0059 


0.0104) 
0.0004) 


T-shape 


7 
B 


0.0612 
0.0160 


;0.0209) 

;o.ooio) 


0.0423 
0.0113 


;0.0135) 

;o.ooo6) 


0.0356 
0.0091 


0.0112) 
0.0005) 


Cross 


7 
B 


0.0610 
0.0159 


;0.0199) 

;o.ooii) 


0.0425 
0.0112 


;o.oi4o) 

'0.0006) 


0.0345 
0.0090 


0.0104) 
0.0005) 


Disk 


7 
B 


0.5702 
0.2125 


;0.1793) 
;0.0218) 


0.1804 
0.0765 


'0.0588) 
'0.0050) 


0.1263 
0.0622 


0.0402) 
0.0017) 


Triangle 


7 
B 


0.6327 
0.2343 


;0.2337) 
;0.0254) 


0.2111 
0.0992 


'0.0752) 
;0.0066) 


0.1491 
0.0775 


0.0541) 
0.0021) 


Butterfly 


7 
B 


1.4385 
0.5536 


;0.5561) 
;0.0570) 


0.5870 
0.2669 


;0.1639) 
'0.0193) 


0.3884 
0.1998 


0.1310) 
0.0071) 



Moreover, we have experimented with the bridge and SCAD penalties for the same data and 
obtained similar results. The desirable unbiased (or nearly unbiased) estimates from those concave 
penalties are reflected by the improved contrast in the recovered signal. But for the sake of space, 
we do not show those flgures here. 

6.2 Attention Deficit Hyperactivity Disorder Data Analysis 

We applied our methods to the attention deficit hyperactivity disorder (ADHD) data from the 
ADHD-200 Sample Inititive (http://fcon_1000.projects.nitrc.org/indi/adhd200/). ADHD is a com- 
mon childhood disorder and can continue through adolescence and adulthood. Symptoms include 
difficulty in staying focused and paying attention, difficulty in controlling behavior, and over- 
activity. The data set that we used is part of the ADHD-200 Global Competition data sets. It 
consists of 776 subjects, with 491 normal controls and 285 combined ADHD subjects. Among them, 
there are 442 males whose mean age is 11.9815 years with standard deviation 3.1355 years, and 
287 females whose mean age is 11.8617 years with standard deviation 3.4875 years. We removed 
47 subjects due to the missing observations or poor image quality. Rs-fMRIs and Tl-weighted 
images were acquired for each subject. The Tl-weighted images were preprocessed by standard 
steps including AC (anterior commissure) and PC (posterior commissure) correction, N2 bias field 
correction, skull-stripping, intensity inhomogeneity correction, cerebellum removal, segmentation, 
and registration. After segmentation, the brains were segmented into four different tissues: grey 
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Figure 2: Demonstration of lasso regularization. The matrix variate has size 64 by 64 with entries 
generated as independent standard normals. The regression coefficient for each entry is either 
(white) or 1 (black). The sample size is 500. 
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matter (GM), white matter (WM), ventricle (VN), and cerebrospinal fluid (CSF). We quantified 



the local volumetric group difi'erences by generating RAVENS maps (Davatzikos et al. , 2001) for 
the whole brain and each of the segmented tissue type (GM, WM, VN, and CSF) respectively, 
using the deformation field we obtained during registration. RAVENS methodology is based on 
a volume-preserving spatial transformation, which ensures that no volumetric information is lost 
during the process of spatial normalization, since this process changes an individual's brain mor- 
phology to conform it to the morphology of a template. In addition to image covariates, we include 
the subjects' age, gender, and whole brain volume as regular covariates. One scientific question 
of interest is to understand association between the disease outcome and the brain image patterns 
after adjustment for the clinical and demographical variables. First, we examined the case with real 
image covariates and simulated responses. Our goal is to study the empirical performance of our 
methods under various response models. Secondly, we showed the performance of the regularized 
estimation in terms of region selection. Finally, we applied the method to the data with the true 
observed binary response. 

6.2.1 Real Image Covariates and Simulated Response 

We first consider a number of GLMs with the real brain image covariates, where r] = ■j^ Z + {B, X), 
the signal tensor B admits a certain structure, 7 = (1,1,1)^, X denotes the 3D MRI image with 
dimension 256 x 256 x 198, and Z denotes the vector of age, gender and whole brain volume. We 
consider two structures for B. The first admits a rank one decomposition, with Bi G IR^'^^^^, 
-B2 e IR^'^''^-', and B3 G IR-'^^^^, and aU of whose (90 + j)th element equal to sin(J7r/14) for 
j = 0, 1, . . . , 14. This corresponds to a single-ball signal in a 3D space. The second admits a 
rank two decomposition, with Bi € IR^^^^^, B2 G IR^^^^^, and S3 G IR^^^^'^. Ah the first 
columns of B^ have their (90 + j)th element equal to sin(j7r/14), and the second columns of 
Bd have their (140 + j)th element equal to sin(j7r/14) for j = 0, 1, ... , 14. This mimics a two- 
ball signal in the 3D space. We then generate the response through the GLM models: for the 
normal model, Y ~ Normal(/i, 1), where fj, = rj; for the binomial model, Y ~ Bernoulli(p), with 
p = 1/[1 + exp(— 0.1?])]; and for the poisson model, Y ~ Poission(;u), with /i = exp(0.0l7?). Table[2] 
summarizes the average RMSE and its standard deviation out of 100 data replications. We see 
that the normal and poisson responses both have competitive performance, whereas the binomial 
case is relatively more challenging. The two-ball signal is more challenging than a one-ball signal, 
and overall the tensor models work well across different response types and different signals. 
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Table 2: Tensor regression estimation for the ADHD data. Reported are mean RMSE and its 
standard deviation (in parenthesis) of evaluation criteria based on 100 data rephcations. 



Signal 


Par am. 


Normal 


Binomial 


Poisson 


one-ball 


7 
B 


0.0639 (0.0290) 
0.0039 (0.0002) 


0.2116 (0.0959) 
0.0065 (0.0002) 


0.0577 (0.0305) 
0.0064 (0.0002) 


two-ball 


1 
B 


0.0711 (0.0310) 
0.0058 (0.0002) 


0.3119 (0.1586) 
0.0082 (0.0003) 


0.0711 (0.0307) 
0.0083 (0.0003) 



6.2.2 Regularized Estimation 

Next we focus on the ability of the proposed regularized tensor regression model to identify relevant 
regions in brain associated with the response. This problem is an analogue of variable selection 
in the traditional regression with vector- valued covariates. We employ the two-ball signal and the 
normal model in Section 6.2.1 Figure |3] shows images with the true signal, the un-regularized 



tensor regression estimate, and the regularized tensor regression estimates with a lasso penalty, 
respectively, overlaid on an image of an arbitrarily chosen subject, or on a 3D rendering of a 
template. The plots clearly show that the true sparse signal regions can be well recovered through 
regularization. 

6.2.3 Real Data Analysis 

Finally, we analyze the ADHD data with the observed binary diagnosis status as the response. 
We fitted a rank-3 tensor logistic regression model, since in practice it is rare that the true signal 
would follow an exact reduced rank formulation. We also applied the regularized estimation using 
a lasso penalty. Figure |4] shows the results. Inspecting Figure |4] reveals two regions of interest: left 
temporal lobe white matter and the splenium that connects parietal and occipital cortices across 
the midline in the corpus callosum. The anatomical disturbance in the temporal lobe has been 
consistently revealed and its interpretation would be consistent with a finer-grained analysis of 
the morphological features of the cortical surface, which reported prominent volume reductions in 



the temporal and frontal cortices in children with ADHD compared with matched controls (Sowell 



et al. 2003). Moreover, a reduced size of the splenium is the most reliable finding in the corpus 



callosum (Valera et al. 2007). 
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True Signal (a 




Figure 3: Region selection. The true signal regions are colored in red, the estimated signal regions 
are in green, and the overlapped regions are in yellow. The left panel is the true or estimated 
signal overlaid on a randomly selected subject, and the right panel is a 3D rendering of the true or 
estimated signal overlaid on the template 
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Figure 4: Application to the ADHD data. Panel (a) is the unpenalized estimate overlaid on a 
randomly selected subject; (b) is the regularized estimate overlaid on a randomly selected subject; 
(c) is a selected slice of the regularized estimate overlaid on the template; and (d) is a 3D rendering 
of the regularized estimate overlaid on the template. 



7 Discussion 

We have proposed a tensor decomposition based approach for regression modeling with array co- 
variates. The curse of dimensionality is circumvented by imposing a low rank approximation to 
the extremely high-dimensional full coefficient array. This allows development of a fast estimation 
algorithm and regularization. Numerical analysis demonstrates that, despite its massive reduction, 
the method works very well in recovering various geometric as well as natural shape images. 

We view the method of this article a first step toward a more general area of array regression 
analysis, and the idea can be extended to a wide range of problems. We describe a few poten- 
tial future directions here. First, although we only present results for models with a conventional 
covariate vector and an array covariate, the framework applies to arbitrary combination of array 
covariates. This provides a promising approach to the analysis of multi-modality data which be- 
comes increasingly available in modern neuroimaging and medical studies. Besides the main effects 
of different array covariates, the interaction between them will be of interest, and can be studied 
under this framework too. Second, we remark that our modeling approach and algorithm equally 
apply to many general loss functions occurring in classification and prediction. For example, for a 
binary response Y S {0, 1}, the hinge loss takes the form 

n _R 

i=l r=l 

and should play an important role in support vector machines with array variates. Third, in this 
article rotation has not been explicitly considered in the modeling. When prior knowledge indicates, 
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sometimes it is prudent to work in polar coordinates. For example, the 'disk' signal in Figure [T] 
can be effectively captured by a rank-1 outer product if the image is coded in polar coordinates. 
A diagonal signal array has full rank and cannot be approximated by any lower rank array, but 
if changed to polar coordinates, the rank reduces to one. Some of these extensions are currently 
under investigation. In summary, we believe that the proposed methodology timely answers calls 
in modern neuroimaging data analysis, whereas the general methodology of tensor regression is 
to play a useful role and also deserves more attention in statistical analysis of high-dimensional 
complex imaging data. 
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